Assignment 4: Street Networks & Web Scraping¶

Part 1: Visualizing crash data in Philadelphia

In this section, you will use osmnx to analyze the crash incidence in Center City.

Part 2: Scraping Craigslist

In this section, you will use Selenium and BeautifulSoup to scrape data for hundreds of apartments from Philadelphia's Craigslist portal.

Part 1: Visualizing crash data in Philadelphia¶

1.1 Load the geometry for the region being analyzed¶

We'll analyze crashes in the "Central" planning district in Philadelphia, a rough approximation for Center City. Planning districts can be loaded from Open Data Philly. Read the data into a GeoDataFrame using the following link:

http://data.phl.opendata.arcgis.com/datasets/0960ea0f38f44146bb562f2b212075aa_0.geojson

Select the "Central" district and extract the geometry polygon for only this district. After this part, you should have a polygon variable of type shapely.geometry.polygon.Polygon.

In [2]:
import osmnx as ox
import geopandas as gpd
In [3]:
planning_district = gpd.read_file("http://data.phl.opendata.arcgis.com/datasets/0960ea0f38f44146bb562f2b212075aa_0.geojson")
In [4]:
planning_district.head()
Out[4]:
OBJECTID_1 OBJECTID DIST_NAME ABBREV Shape__Area Shape__Length PlanningDist DaytimePop geometry
0 1 14 River Wards RW 2.107270e+08 66931.595020 NaN NaN POLYGON ((-75.09798 40.00496, -75.09687 40.005...
1 2 3 North Delaware NDEL 2.700915e+08 89213.074378 NaN NaN POLYGON ((-74.98159 40.05363, -74.98139 40.053...
2 3 0 Lower Far Northeast LFNE 3.068529e+08 92703.285159 NaN NaN POLYGON ((-74.96443 40.11728, -74.96434 40.117...
3 4 9 Central CTR 1.782880e+08 71405.143450 NaN NaN POLYGON ((-75.14791 39.96733, -75.14715 39.967...
4 5 10 University Southwest USW 1.296468e+08 65267.676141 NaN NaN POLYGON ((-75.18742 39.96338, -75.18644 39.963...
In [5]:
central_district = planning_district.query("DIST_NAME == 'Central'")
In [6]:
ax = ox.project_gdf(central_district).plot(fc="lightblue", ec="gray")
ax.set_axis_off()
No description has been provided for this image
In [7]:
center_city_outline = central_district.squeeze().geometry

center_city_outline
Out[7]:
No description has been provided for this image
In [8]:
type(center_city_outline)
Out[8]:
shapely.geometry.polygon.Polygon

1.2 Get the street network graph¶

Use OSMnx to create a network graph (of type 'drive') from your polygon boundary in 1.1.

In [9]:
# Get the graph
G_cc = ox.graph_from_polygon(center_city_outline, network_type="drive")
/Users/hangzhao/mambaforge/envs/musa-550-fall-2023/lib/python3.10/site-packages/shapely/constructive.py:181: RuntimeWarning: invalid value encountered in buffer
  return lib.buffer(
/Users/hangzhao/mambaforge/envs/musa-550-fall-2023/lib/python3.10/site-packages/shapely/predicates.py:798: RuntimeWarning: invalid value encountered in intersects
  return lib.intersects(a, b, **kwargs)
/Users/hangzhao/mambaforge/envs/musa-550-fall-2023/lib/python3.10/site-packages/shapely/set_operations.py:340: RuntimeWarning: invalid value encountered in union
  return lib.union(a, b, **kwargs)
/Users/hangzhao/mambaforge/envs/musa-550-fall-2023/lib/python3.10/site-packages/shapely/predicates.py:798: RuntimeWarning: invalid value encountered in intersects
  return lib.intersects(a, b, **kwargs)
/Users/hangzhao/mambaforge/envs/musa-550-fall-2023/lib/python3.10/site-packages/shapely/set_operations.py:340: RuntimeWarning: invalid value encountered in union
  return lib.union(a, b, **kwargs)
In [10]:
# Viola!
ox.plot_graph(ox.project_graph(G_cc), node_size=0);
No description has been provided for this image

1.3 Convert your network graph edges to a GeoDataFrame¶

Use OSMnx to create a GeoDataFrame of the network edges in the graph object from part 1.2. The GeoDataFrame should contain the edges but not the nodes from the network.

In [11]:
type(G_cc)
Out[11]:
networkx.classes.multidigraph.MultiDiGraph
In [12]:
# only get the edges
cc_edges = ox.graph_to_gdfs(G_cc, edges=True, nodes=False)
In [35]:
type(cc_edges)
Out[35]:
geopandas.geodataframe.GeoDataFrame

1.4 Load PennDOT crash data¶

Data for crashes (of all types) for 2020, 2021, and 2022 in Philadelphia County is available at the following path:

./data/CRASH_PHILADELPHIA_XXXX.csv

You should see three separate files in the data/ folder. Use pandas to read each of the CSV files, and combine them into a single dataframe using pd.concat().

The data was downloaded for Philadelphia County from here.

In [14]:
import pandas as pd
In [18]:
crash_2020 = pd.read_csv("./data/CRASH_PHILADELPHIA_2020.csv")
crash_2021 = pd.read_csv("./data/CRASH_PHILADELPHIA_2021.csv")
crash_2022 = pd.read_csv("./data/CRASH_PHILADELPHIA_2022.csv")
crash_all = pd.concat([crash_2020, crash_2021, crash_2022])

1.5 Convert the crash data to a GeoDataFrame¶

You will need to use the DEC_LAT and DEC_LONG columns for latitude and longitude.

The full data dictionary for the data is available here

In [23]:
crash_all = crash_all.dropna(subset=["DEC_LAT", "DEC_LONG"])
In [26]:
g_crash = gpd.GeoDataFrame(
    crash_all,  # The pandas DataFrame
    geometry=gpd.points_from_xy(crash_all["DEC_LONG"], crash_all["DEC_LAT"]), # The geometry!
    crs="EPSG:4326", # The CRS 
)
type(g_crash)
Out[26]:
geopandas.geodataframe.GeoDataFrame

1.6 Trim the crash data to Center City¶

  1. Get the boundary of the edges data frame (from part 1.3). Accessing the .geometry.unary_union.convex_hull property will give you a nice outer boundary region.
  2. Trim the crashes using the within() function of the crash GeoDataFrame to find which crashes are within the boundary.

There should be about 3,750 crashes within the Central district.

In [29]:
boundary = cc_edges.geometry.unary_union.convex_hull
/Users/hangzhao/mambaforge/envs/musa-550-fall-2023/lib/python3.10/site-packages/shapely/set_operations.py:426: RuntimeWarning: invalid value encountered in unary_union
  return lib.unary_union(collections, **kwargs)
In [36]:
within = g_crash[g_crash.within(boundary) == True]
In [41]:
len(within)
Out[41]:
3751

1.7 Re-project our data into an approriate CRS¶

We'll need to find the nearest edge (street) in our graph for each crash. To do this, osmnx will calculate the distance from each crash to the graph edges. For this calculation to be accurate, we need to convert from latitude/longitude

We'll convert the local state plane CRS for Philadelphia, EPSG=2272

Two steps:¶

  1. Project the graph object (G) using the ox.project_graph. Run ox.project_graph? to see the documentation for how to convert to a specific CRS.
  2. Project the crash data using the .to_crs() function.
In [42]:
reproject = ox.project_graph(G_cc, to_crs=2272)
In [38]:
ox.project_graph?
Signature: ox.project_graph(G, to_crs=None)
Docstring:
Reproject a graph from its current CRS to another.

If `to_crs` is None, project the graph to the UTM CRS for the UTM zone in
which the graph's centroid lies. Otherwise, project the graph to the CRS
defined by `to_crs`.

Parameters
----------
G : networkx.MultiDiGraph
    the graph to be projected
to_crs : string or pyproj.CRS
    if None, project graph to UTM zone in which graph centroid lies,
    otherwise project graph to this CRS

Returns
-------
G_proj : networkx.MultiDiGraph
    the projected graph
File:      ~/mambaforge/envs/musa-550-fall-2023/lib/python3.10/site-packages/osmnx/projection.py
Type:      function

1.8 Find the nearest edge for each crash¶

See: ox.distance.nearest_edges(). It takes three arguments:

  • the network graph
  • the longitude of your crash data (the x attribute of the geometry column)
  • the latitude of your crash data (the y attribute of the geometry column)

You will get a numpy array with 3 columns that represent (u, v, key) where each u and v are the node IDs that the edge links together. We will ignore the key value for our analysis.

In [57]:
within_x = within.geometry.x
within_y = within.geometry.y
In [58]:
node = ox.distance.nearest_edges(G_cc, within_x, within_y) 
In [46]:
ox.distance.nearest_edges?
Signature: ox.distance.nearest_edges(G, X, Y, interpolate=None, return_dist=False)
Docstring:
Find the nearest edge to a point or to each of several points.

If `X` and `Y` are single coordinate values, this will return the nearest
edge to that point. If `X` and `Y` are lists of coordinate values, this
will return the nearest edge to each point. This function uses an R-tree
spatial index and minimizes the euclidean distance from each point to the
possible matches. For accurate results, use a projected graph and points.

Parameters
----------
G : networkx.MultiDiGraph
    graph in which to find nearest edges
X : float or list
    points' x (longitude) coordinates, in same CRS/units as graph and
    containing no nulls
Y : float or list
    points' y (latitude) coordinates, in same CRS/units as graph and
    containing no nulls
interpolate : float
    deprecated, do not use
return_dist : bool
    optionally also return distance between points and nearest edges

Returns
-------
ne or (ne, dist) : tuple or list
    nearest edges as (u, v, key) or optionally a tuple where `dist`
    contains distances between the points and their nearest edges
File:      ~/mambaforge/envs/musa-550-fall-2023/lib/python3.10/site-packages/osmnx/distance.py
Type:      function

1.9 Calculate the total number of crashes per street¶

  1. Make a DataFrame from your data from part 1.7 with three columns, u, v, and key (we will only use the u and v columns)
  2. Group by u and v and calculate the size
  3. Reset the index and name your size() column as crash_count

After this step you should have a DataFrame with three columns: u, v, and crash_count.

In [70]:
# convert graph to geodataframe
reproject_gdf = ox.graph_to_gdfs(reproject, edges=True, nodes=False)
In [87]:
reproject_groupby = reproject_gdf.groupby(["u", "v"], as_index=False).size().rename(columns={"size": "crash_count"})
len(reproject_groupby)
Out[87]:
3883

1.10 Merge your edges GeoDataFrame and crash count DataFrame¶

You can use pandas to merge them on the u and v columns. This will associate the total crash count with each edge in the street network.

Tips:

  • Use a left merge where the first argument of the merge is the edges GeoDataFrame. This ensures no edges are removed during the merge.
  • Use the fillna(0) function to fill in missing crash count values with zero.
In [171]:
merged_data = pd.merge(cc_edges, reproject_groupby, how="left", on=["u", "v"]).fillna(0)
#len(merged_data)
type(merged_data)
Out[171]:
geopandas.geodataframe.GeoDataFrame

1.11 Calculate a "Crash Index"¶

Let's calculate a "crash index" that provides a normalized measure of the crash frequency per street. To do this, we'll need to:

  1. Calculate the total crash count divided by the street length, using the length column
  2. Perform a log transformation of the crash/length variable — use numpy's log10() function
  3. Normalize the index from 0 to 1 (see the lecture notes for an example of this transformation)

Note: since the crash index involves a log transformation, you should only calculate the index for streets where the crash count is greater than zero.

After this step, you should have a new column in the data frame from 1.9 that includes a column called part 1.9.

In [103]:
import numpy as np
In [172]:
crash_by_length = merged_data['crash_count'].count()/merged_data['length']
logged = np.log10(crash_by_length)
# normalized = log_index/crash_index.max()
# merged_data["part 1.9"] = normalized
# normalized
merged_data["part 1.9"] = (logged-np.min(logged))/(np.max(logged)-np.min(logged))
merged_data["part 1.9"].mean()
Out[172]:
0.464853601936429

1.12 Plot a histogram of the crash index values¶

Use matplotlib's hist() function to plot the crash index values from the previous step.

You should see that the index values are Gaussian-distributed, providing justification for why we log-transformed!

In [131]:
import matplotlib.pyplot as plt
In [173]:
plt.hist(normalized)
Out[173]:
(array([   4.,   27.,   88., 1099., 1366.,  906.,  238.,  144.,   19.,
           5.]),
 array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
 <BarContainer object of 10 artists>)
No description has been provided for this image

1.13 Plot an interactive map of the street networks, colored by the crash index¶

You can use GeoPandas to make an interactive Folium map, coloring the streets by the crash index column.

Tip: if you use the viridis color map, try using a "dark" tile set for better constrast of the colors.

In [166]:
import folium
In [188]:
interactive_map = merged_data.explore(column="part 1.9", tiles="CartoDB dark_matter", scheme="Quantiles", k=5, cmap="magma")

interactive_map
Out[188]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [ ]:
 

Part 2: Scraping Craigslist¶

In this part, we'll be extracting information on apartments from Craigslist search results. You'll be using Selenium and BeautifulSoup to extract the relevant information from the HTML text.

For reference on CSS selectors, please see the notes from Week 6.

Primer: the Craigslist website URL¶

We'll start with the Philadelphia region. First we need to figure out how to submit a query to Craigslist. As with many websites, one way you can do this is simply by constructing the proper URL and sending it to Craigslist.

https://philadelphia.craigslist.org/search/apa?min_price=1&min_bedrooms=1&minSqft=1#search=1~gallery~0~0

There are three components to this URL.

  1. The base URL: http://philadelphia.craigslist.org/search/apa

  2. The user's search parameters: ?min_price=1&min_bedrooms=1&minSqft=1

We will send nonzero defaults for some parameters (bedrooms, size, price) in order to exclude results that have empty values for these parameters.

  1. The URL hash: #search=1~gallery~0~0

As we will see later, this part will be important because it contains the search page result number.

The Craigslist website requires Javascript, so we'll need to use Selenium to load the page, and then use BeautifulSoup to extract the information we want.

2.1 Initialize a selenium driver and open Craigslist¶

As discussed in lecture, you can use Chrome, Firefox, or Edge as your selenium driver. In this part, you should do two things:

  1. Initialize the selenium driver
  2. Use the driver.get() function to open the following URL:

https://philadelphia.craigslist.org/search/apa?min_price=1&min_bedrooms=1&minSqft=1#search=1~gallery~0~0

This will give you the search results for 1-bedroom apartments in Philadelphia.

In [1]:
# Import the webdriver from selenium
from selenium import webdriver
In [2]:
# UNCOMMENT BELOW TO USE CHROME

driver = webdriver.Chrome()
In [3]:
url = "https://philadelphia.craigslist.org/search/apa?min_price=1&min_bedrooms=1&minSqft=1#search=1gallery0~0"
driver.get(url)

2.2 Initialize your "soup"¶

Once selenium has the page open, we can get the page source from the driver and use BeautifulSoup to parse it. In this part, initialize a BeautifulSoup object with the driver's page source

In [3]:
# Start with the usual imports
# We'll use these throughout
import pandas as pd
from bs4 import BeautifulSoup
import requests
In [6]:
# Initialize the soup for this page

soup1 = BeautifulSoup(driver.page_source, "html.parser")

2.3 Parsing the HTML¶

Now that we have our "soup" object, we can use BeautifulSoup to extract out the elements we need:

  • Use the Web Inspector to identify the HTML element that holds the information on each apartment listing.
  • Use BeautifulSoup to extract these elements from the HTML.

At the end of this part, you should have a list of 120 elements, where each element is the listing for a specific apartment on the search page.

In [34]:
# Use the # for ID selector
selector = ".cl-search-result"

rows = soup1.select(selector)
len(rows)
Out[34]:
120

2.4 Find the relevant pieces of information¶

We will now focus on the first element in the list of 120 apartments. Use the prettify() function to print out the HTML for this first element.

From this HTML, identify the HTML elements that hold:

  • The apartment price
  • The number of bedrooms
  • The square footage
  • The apartment title

For the first apartment, print out each of these pieces of information, using BeautifulSoup to select the proper elements.

In [36]:
row1 = rows[0]
In [37]:
print(row1.prettify())
<li class="cl-search-result cl-search-view-mode-gallery" data-pid="7682740541" title="Queen Village Bi-Level Charming Best Location">
 <div class="gallery-card">
  <div class="cl-gallery">
   <div class="gallery-inner">
    <a class="main" href="https://philadelphia.craigslist.org/apa/d/philadelphia-queen-village-bi-level/7682740541.html">
     <div class="swipe" style="visibility: visible;">
      <div class="swipe-wrap" style="width: 5056px;">
       <div data-index="0" style="width: 316px; left: 0px; transition-duration: 0ms; transform: translateX(0px);">
        <span class="loading icom-">
        </span>
        <img alt="Queen Village Bi-Level Charming Best Location 1" src="https://images.craigslist.org/01212_bJE38ghXjLv_0gk0t2_300x300.jpg"/>
       </div>
       <div data-index="1" style="width: 316px; left: -316px; transition-duration: 0ms; transform: translateX(316px);">
       </div>
       <div data-index="2" style="width: 316px; left: -632px; transition-duration: 0ms; transform: translateX(316px);">
       </div>
       <div data-index="3" style="width: 316px; left: -948px; transition-duration: 0ms; transform: translateX(316px);">
       </div>
       <div data-index="4" style="width: 316px; left: -1264px; transition-duration: 0ms; transform: translateX(316px);">
       </div>
       <div data-index="5" style="width: 316px; left: -1580px; transition-duration: 0ms; transform: translateX(316px);">
       </div>
       <div data-index="6" style="width: 316px; left: -1896px; transition-duration: 0ms; transform: translateX(316px);">
       </div>
       <div data-index="7" style="width: 316px; left: -2212px; transition-duration: 0ms; transform: translateX(-316px);">
       </div>
      </div>
     </div>
     <div class="slider-back-arrow icom-">
     </div>
     <div class="slider-forward-arrow icom-">
     </div>
    </a>
   </div>
   <div class="dots">
    <span class="dot selected">
     •
    </span>
    <span class="dot">
     •
    </span>
    <span class="dot">
     •
    </span>
    <span class="dot">
     •
    </span>
    <span class="dot">
     •
    </span>
    <span class="dot">
     •
    </span>
    <span class="dot">
     •
    </span>
    <span class="dot">
     •
    </span>
   </div>
  </div>
  <a class="cl-app-anchor text-only posting-title" href="https://philadelphia.craigslist.org/apa/d/philadelphia-queen-village-bi-level/7682740541.html" tabindex="0">
   <span class="label">
    Queen Village Bi-Level Charming Best Location
   </span>
  </a>
  <div class="meta">
   17 mins ago
   <span class="separator">
    ·
   </span>
   <span class="housing-meta">
    <span class="post-bedrooms">
     2br
    </span>
    <span class="post-sqft">
     1000ft
     <span class="exponent">
      2
     </span>
    </span>
   </span>
   <span class="separator">
    ·
   </span>
   Queen Village
  </div>
  <span class="priceinfo">
   $1,950
  </span>
  <button class="bd-button cl-favorite-button icon-only" tabindex="0" title="add to favorites list" type="button">
   <span class="icon icom-">
   </span>
   <span class="label">
   </span>
  </button>
  <button class="bd-button cl-banish-button icon-only" tabindex="0" title="hide posting" type="button">
   <span class="icon icom-">
   </span>
   <span class="label">
    hide
   </span>
  </button>
 </div>
</li>

In [42]:
# Use the . to specify class name
price = row1.select_one(".priceinfo").text
print("the apartment price is", price)

# The number of bedroom
nbed = row1.select_one(".post-bedrooms").text
print("the number of bedroom is", nbed)

# The square footage
square = row1.select_one(".post-sqft").text
print("the square footage is", square)

# apartment title
title = row1.select_one(".label").text
print("the apartment title is", title)
the apartment price is $1,950
the number of bedroom is 2br
the square footage is 1000ft2
the apartment title is Queen Village Bi-Level Charming Best Location

2.5 Functions to format the results¶

In this section, you'll create functions that take in the raw string elements for price, size, and number of bedrooms and returns them formatted as numbers.

I've started the functions to format the values. You should finish theses functions in this section.

Hints

  • You can use string formatting functions like string.replace() and string.strip()
  • The int() and float() functions can convert strings to numbers
In [4]:
def format_bedrooms(bedrooms_string):
    # Format the bedrooms string and return an int
    # 
    # This will involve using the string.replace() function to 
    # remove unwanted characters
    x = bedrooms_string.replace("br", "")
    
    return int(x)
In [5]:
def format_size(size_string):
    # Format the size string and return a float
    # 
    # This will involve using the string.replace() function to 
    # remove unwanted characters
    y = size_string.replace("ft2", "")
    
    return float(y)
In [6]:
def format_price(price_string):
    # Format the price string and return a float
    # 
    # This will involve using the string.strip() function to 
    # remove unwanted characters
    z = price_string.strip("$").replace(",","")
    
    return float(z)

2.6 Putting it all together¶

In this part, you'll complete the code block below using results from previous parts. The code will loop over 5 pages of search results and scrape data for 600 apartments.

We can get a specific page by changing the search=PAGE part of the URL hash. For example, to get page 2 instead of page 1, we will navigate to:

https://philadelphia.craigslist.org/search/apa?min_price=1&min_bedrooms=1&minSqft=1#search=2~gallery~0~0

In the code below, the outer for loop will loop over 5 pages of search results. The inner for loop will loop over the 120 apartments listed on each search page.

Fill in the missing pieces of the inner loop using the code from the previous section. We will be able to extract out the relevant pieces of info for each apartment.

After filling in the missing pieces and executing the code cell, you should have a Data Frame called results that holds the data for 600 apartment listings.

Notes¶

Be careful if you try to scrape more listings. Craigslist will temporarily ban your IP address (for a very short time) if you scrape too much at once. I've added a sleep() function to the for loop to wait 30 seconds between scraping requests.

If the for loop gets stuck at the "Processing page X..." step for more than a minute or so, your IP address is probably banned temporarily, and you'll have to wait a few minutes before trying again.

In [7]:
from time import sleep
In [8]:
results = []

# search in batches of 120 for 5 pages
# NOTE: you will get temporarily banned if running more than ~5 pages or so
# the API limits are more leninient during off-peak times, and you can try
# experimenting with more pages
max_pages = 5

# The base URL we will be using
base_url = "https://philadelphia.craigslist.org/search/apa?min_price=1&min_bedrooms=1&minSqft=1"

# loop over each page of search results
for page_num in range(0, max_pages):
    print(f"Processing page {page_num}...")

    # Update the URL hash for this page number and make the combined URL
    url_hash = f"#search=1~gallery~{page_num}~0"
    url = base_url + url_hash

    # Go to the driver and wait for 5 seconds
    driver.get(url)
    sleep(5)

    # YOUR CODE: get the list of all apartments
    # This is the same code from Part 1.2 and 1.3
    # It should be a list of 120 apartments
    soup = BeautifulSoup(driver.page_source, "html.parser")
    selector = ".cl-search-result"
    apts = soup.select(selector)
    print("Number of apartments = ", len(apts))

    # loop over each apartment in the list
    page_results = []
    for apt in apts:

        # YOUR CODE: the bedrooms string
        bedrooms = apt.select_one(".post-bedrooms").text

        # YOUR CODE: the size string
        size = apt.select_one(".post-sqft").text

        # YOUR CODE: the title string
        title = apt.select_one(".label").text

        # YOUR CODE: the price string
        price = apt.select_one(".priceinfo").text


        # Format using functions from Part 1.5
        bedrooms = format_bedrooms(bedrooms)
        size = format_size(size)
        price = format_price(price)

        # Save the result
        page_results.append([price, size, bedrooms, title])

    # Create a dataframe and save
    col_names = ["price", "size", "bedrooms", "title"]
    df = pd.DataFrame(page_results, columns=col_names)
    results.append(df)

    print("sleeping for 10 seconds between calls")
    sleep(10)

# Finally, concatenate all the results
results = pd.concat(results, axis=0).reset_index(drop=True)
Processing page 1...
Number of apartments =  120
sleeping for 10 seconds between calls
Processing page 2...
Number of apartments =  120
sleeping for 10 seconds between calls
Processing page 3...
Number of apartments =  120
sleeping for 10 seconds between calls
Processing page 4...
Number of apartments =  120
sleeping for 10 seconds between calls
Processing page 5...
Number of apartments =  120
sleeping for 10 seconds between calls
In [9]:
results.tail()
Out[9]:
price size bedrooms title
595 1550.0 515.0 1 Modern Masterpiece in the Heart of Northern Li...
596 2339.0 1157.0 2 Sugar and Spice With EVERYTHING Nice!
597 2008.0 597.0 1 1 Bed, Concierge, Patio/Balcony
598 1406.0 300.0 1 Beautifully renovated apartment with plenty of...
599 4500.0 1046.0 2 24/7 onsite concierge service, On-demand dog w...
In [10]:
results[0:10]
Out[10]:
price size bedrooms title
0 1735.0 600.0 2 NO ANNUAL RENT INCREASE & REFUNDABLE APPLICATI...
1 2500.0 1520.0 3 3 BR, 1.5 bath quaint house w/parking in prime...
2 1977.0 532.0 1 Handyman and maintenance service, On-demand ca...
3 4187.0 1215.0 2 2/bd, Dog Wash Station, Yoga Studio
4 1500.0 900.0 1 1bd Apt month free 11th St.&Pine St Philly Cen...
5 1768.0 680.0 1 Great 1 bed / 1 bath! We Love Pets! Ask about ...
6 1899.0 692.0 1 Heated 3 Season Pool, Cyber Café, Zen Courtyard
7 1295.0 650.0 1 Totally Renovated Apartment One Block from Mai...
8 1897.0 477.0 1 In Philadelphia, 1/bd, Quartz Countertops
9 1500.0 900.0 1 1bd Rm Apt w/a DECK! One month free and utilit...
In [11]:
results[120:130]
Out[11]:
price size bedrooms title
120 1735.0 600.0 2 NO ANNUAL RENT INCREASE & REFUNDABLE APPLICATI...
121 2500.0 1520.0 3 3 BR, 1.5 bath quaint house w/parking in prime...
122 1977.0 532.0 1 Handyman and maintenance service, On-demand ca...
123 4187.0 1215.0 2 2/bd, Dog Wash Station, Yoga Studio
124 1500.0 900.0 1 1bd Apt month free 11th St.&Pine St Philly Cen...
125 1768.0 680.0 1 Great 1 bed / 1 bath! We Love Pets! Ask about ...
126 1899.0 692.0 1 Heated 3 Season Pool, Cyber Café, Zen Courtyard
127 1295.0 650.0 1 Totally Renovated Apartment One Block from Mai...
128 1897.0 477.0 1 In Philadelphia, 1/bd, Quartz Countertops
129 1500.0 900.0 1 1bd Rm Apt w/a DECK! One month free and utilit...

2.7 Plotting the distribution of prices¶

Use matplotlib's hist() function to make two histograms for:

  • Apartment prices
  • Apartment prices per square foot (price / size)

Make sure to add labels to the respective axes and a title describing the plot.

In [12]:
import matplotlib.pyplot as plt
In [13]:
plt.hist(results["price"], bins=45, edgecolor='black', linewidth = 0.5, color='orange')
plt.title("apartment prices distribution in Philadelphia")
plt.xlabel("price per month")
plt.ylabel("number of apartments")
Out[13]:
Text(0, 0.5, 'number of apartments')
No description has been provided for this image
In [19]:
results.duplicated()
Out[19]:
0      False
1      False
2      False
3      False
4      False
       ...  
595     True
596     True
597     True
598     True
599     True
Length: 600, dtype: bool
In [15]:
price_size = results["price"]/results["size"]
In [16]:
plt.hist(price_size, bins=45, edgecolor='black', linewidth = 0.5, color='gray')
plt.title("apartment prices per square foot in Philadelphia")
plt.xlabel("price per square foot/month")
plt.ylabel("number of apartments")
Out[16]:
Text(0, 0.5, 'number of apartments')
No description has been provided for this image

Side note: rental prices per sq. ft. from Craigslist¶

The histogram of price per sq ft should be centered around ~1.5. Here is a plot of how Philadelphia's rents compare to the other most populous cities:

No description has been provided for this image

Source

2.8 Comparing prices for different sizes¶

Use altair to explore the relationship between price, size, and number of bedrooms. Make an interactive scatter plot of price (x-axis) vs. size (y-axis), with the points colored by the number of bedrooms.

Make sure the plot is interactive (zoom-able and pan-able) and add a tooltip with all of the columns in our scraped data frame.

With this sort of plot, you can quickly see the outlier apartments in terms of size and price.

In [17]:
import altair as alt  
In [18]:
# Step 1: Initialize the chart with the data
chart = alt.Chart(results)

# Step 2: Define what kind of marks to use
chart = chart.mark_circle(size=60)

# Step 3: Encode the visual channels
chart = chart.encode(
    x="price",
    y="size",
    color="bedrooms", 
    tooltip=["price", "size", "bedrooms", "title"],
)

# Optional: Make the chart interactive
chart.interactive()
Out[18]:
In [ ]: